Building a prediction model for Bproperty.com¶

Bproperty is a property solutions provider, for both tenants and owners, in Bangladesh. They cater to the needs of those seeking real estate services offering a platform that enables anyone to buy, rent or sell properties in the country. After a owner has requested to put their real estate advertisement on Bproperty's website, Bproperty contacts the owner and sends a representative to scout the property. As a result no individual can put their property listing directly in the Bproperty's website and hence, the probability of fake listings are minimal. In this project, about 50 thousand rent listings in Bproperty.com from five cities of Bangldesh was extracted and analyzed.

The main objective here is to build a model that will predict the rent of an apartment with maximum possible accuracy.

This model can be used by the following parties:

  1. Bproperty.com: This predictive model will be most beneficial for bproperty.com itself. If they feed the information of apartments that were successfully traded, they can easily predict the optimum rent for a listing depending on number of bedrooms, baths, features; and the city the apartment is situated in. Later, they can provide better consultations to the owners. Moreover same model can be deployed for other types of properties as well.
  2. Individual real estate owners: Individuals owning an apartment and willing to rent it out to others, can benefit from this model, as, properties can be rented in a free competitive marketplace only if it is rented at or less than the competitive market price. If the property is rented less than the market price then the owner will lose out on some money. On the other hand, if the asking rent is more than the market price then people will not pay for this property.
  3. Tenants: Like the landlord, tenants can benefit from this model as well. Moreover, if an indidual wants to rent an house or flat in a particular location according to their personal need of number of bedroom and bathroom, and area of the house, they can easily predict the rent using this model.
  4. Real estate companies: Real estate companies willing to undertake projects in a particular location can easily predict what the rent will be given their specification. This will help the companies evaluate the project before investing more accurately.

Importing necessary libraries and Data Collection¶

In [2]:
import pandas as pd
from matplotlib import pyplot as plt
from google.colab import drive
from bs4 import BeautifulSoup
import requests

For this project, from bproperty's all listings apartments to be rented will be used. The same code can be replicated to collect data for other property types: room, duplex, plaza, building, plot, office, shop, etc. from bproperty.com.

Bproperty's has almost 45-50 thousand active listings of apartments to date and at a time details of 24 listings can be viewed in one page. So about 2084 pages should be there for apartment listings. Page numbers are used to differentiate the URLs, meaning the page 24 will have the URL: main url(https://www.bproperty.com/en/bangladesh/apartments-for-rent/) + "page-" + page number(24) --> https://www.bproperty.com/en/bangladesh/apartments-for-rent/page-24/. So, a loop has been created to visit 2000+ webpages and collect the links, containing details, for each of the apartment.

In [ ]:
listings_links = []

#a string will be attached to main_url to create specific url for each webpages
main_url = "https://www.bproperty.com/en/bangladesh/apartments-for-rent/"
url = main_url

#loop to visit webpages that contains 24 links to each apartment details
for page_number in range(1, 2084):

  #getting HTML of a particular webpage in text format
  raw_data = requests.get(url).text

  #parsing the HTML collected from the webpage
  soup = BeautifulSoup(raw_data, 'html5lib')

  #in each webpage there are short profiles and a link to details of 24 different apartment listings.
  #The required links for this project are saved in class named "_287661cb" within tag "a"
  #creating a loop to collect all the available links within the mentioned class and tag
  for link in soup.find_all("a", class_ = "_287661cb", href = True):
    listings_links.append(link['href'])

  #updating the url to visit next webpage
  url = main_url + "page-" + str(page_number)
In [ ]:
#mounting google drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive
In [ ]:
#saving the urls in drive
filename = "links.csv"

#converting the list, containing the desired links, into dataframe and saving it
links_csv = pd.DataFrame(listings_links)
links_csv.to_csv(filename)

# Path for my google drive Folder
!cp $filename '/content/gdrive/My Drive/data science projects/'
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

The list should containts almost 50,000 links to the details of apartments. The details are number of bedrooms, number of baths, the size measured in sqft, rent, address (city, neighborhood, area), and features. For simplicity of analysis instead of details of features, only the total number of features will be extracted.

N.B.: the links collected in the last section contains only a fragment of the link. So the main url will be used as the base and the fragmented links will be attached to in the loop

In [ ]:
#creating a dataframe with the columns for the variables available and required for this project from details page
bproperty_data = pd.DataFrame(columns=["bedrooms",
                                       "baths",
                                       "sqft",
                                       "rent",
                                       "address",
                                       "added_date",
                                       "features",
                                       "link"])

#loading the links to details page of all the apartments from drive (incase a colab new session has started)
#listings_links = pd.read_csv("/content/gdrive/My Drive/data science projects/links.csv")['0']
#print(listings_links[0:5])

#the base url
main_url = "https://www.bproperty.com"

#creating to loop to visit each of the link available and extract our desired datapoints
for url_fragment in listings_links[bproperty_data.shape[0]:len(listings_links)]:

  #attaching the fragmented link with the main url
  url = main_url + url_fragment

  #fetching and parsing the HTML
  scraped_data = requests.get(url).text
  soup = BeautifulSoup(scraped_data, "html5lib")

  #extracting datapoints as text from identified tag, class, and position from parsed HTML
  bedrooms = soup.find_all("span", class_ = "fc2d1086")[0].text
  baths = soup.find_all("span", class_ = "fc2d1086")[1].text
  sqft = soup.find_all("span", class_ = "fc2d1086")[2].find("span").text
  address = soup.find("div", class_ = "_1f0f1758").text
  added_date = soup.find_all("span", class_ = "_812aa185")[3].text
  features = len(soup.find_all("span", class_ = "_005a682a"))
  rent = soup.find("span", class_ = "_105b8a67").text

  #attaching the datapoints in relevant columns in the main dataframe
  bproperty_data = bproperty_data.append({"bedrooms": bedrooms,
                                          "baths": baths,
                                          "sqft": sqft,
                                          "rent": rent,
                                          "address": address,
                                          "added_date": added_date,
                                          "features": features,
                                          "link": url},
                                         ignore_index=True)
  print(bproperty_data.shape)
In [ ]:
#saving the urls in drive
#mounting google drive
drive.mount('/content/gdrive')

# Path for my google drive Folder
filename = "bproperty_data.csv"

bproperty_data.to_csv(filename)

!cp $filename '/content/gdrive/My Drive/data science projects/'
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
In [ ]:
#loading the main dataset from drive (incase a colab new session has started)
df = pd.read_csv("/content/gdrive/My Drive/data science projects/predicting house rents/bproperty_data.csv")
In [6]:
df = pd.read_csv("bproperty_data.csv")

Cleaning the dataset¶

In [7]:
df.head()
Out[7]:
Unnamed: 0 bedrooms baths sqft rent address added_date features link
0 0 2 Beds 2 Baths 1,000 sqft 12,500 Darjiban Mosque Road, Dargi Para, Sylhet 14-Nov-21 6 https://www.bproperty.com/en/property/details-...
1 1 5 Beds 3 Baths 1,800 sqft 20,000 Mojumdarpara Road, Mazumder Para, Sylhet 14-Nov-21 7 https://www.bproperty.com/en/property/details-...
2 2 2 Beds 2 Baths 900 sqft 12,000 Masjid Lane Society, Sholokbahar, Chattogram 14-Nov-21 8 https://www.bproperty.com/en/property/details-...
3 3 2 Beds 2 Baths 900 sqft 13,000 Masjid Lane Society, Sholokbahar, Chattogram 14-Nov-21 8 https://www.bproperty.com/en/property/details-...
4 4 2 Beds 2 Baths 1,000 sqft 14,000 Uddipon R/A, Mira Bazar, Sylhet 14-Nov-21 6 https://www.bproperty.com/en/property/details-...

The address column describes the city, neighborhood, and area name. Address column can be split to find city name and neighbrhood name as they are separated by comma. The first part of the address is area name; after first comma, in second part the address column has the neighborhood name; finally after second comma, in third part of the address remains the city name. Other than working with the address column, two columns: "unnamed: 0", and "link", "added_date" will be dropped.

In [8]:
#Dropping the mentioned columns
df_cleaned = df.drop(labels = ["Unnamed: 0", "link", "added_date"], axis = 1)
df_cleaned.head()
Out[8]:
bedrooms baths sqft rent address features
0 2 Beds 2 Baths 1,000 sqft 12,500 Darjiban Mosque Road, Dargi Para, Sylhet 6
1 5 Beds 3 Baths 1,800 sqft 20,000 Mojumdarpara Road, Mazumder Para, Sylhet 7
2 2 Beds 2 Baths 900 sqft 12,000 Masjid Lane Society, Sholokbahar, Chattogram 8
3 2 Beds 2 Baths 900 sqft 13,000 Masjid Lane Society, Sholokbahar, Chattogram 8
4 2 Beds 2 Baths 1,000 sqft 14,000 Uddipon R/A, Mira Bazar, Sylhet 6
In [9]:
#removing string values from datapoints of bedrooms and baths column
df_cleaned["bedrooms"] = df_cleaned["bedrooms"].str.split(" ", expand = True)[0]
df_cleaned["baths"] = df_cleaned["baths"].str.split(" ", expand = True)[0]
df_cleaned["sqft"] = df_cleaned["sqft"].str.split(" ", expand = True)[0]
In [10]:
#checking if all the values of bedrooms column after cleaning has any strings in it
df_cleaned['bedrooms'].value_counts().to_frame()
Out[10]:
count
bedrooms
2 27138
3 18745
1 2635
4 1289
5 34
6 5
7 1
Studio 1
In [11]:
#removing the only string value "Studio" from the bedrooms column
df_cleaned = df_cleaned[df_cleaned.bedrooms != "Studio"]
In [12]:
#checking if all the values of baths column after cleaning has any strings in it
df_cleaned["baths"].value_counts().to_frame()
Out[12]:
count
baths
2 26753
3 11757
1 8761
4 2376
5 199
6 1
In [13]:
#removing the comma from numerical values in column "sqft" and "rent"
#converting the columns into float type
df_cleaned["rent"] = df_cleaned["rent"].str.replace(",", "").astype("float")
df_cleaned["sqft"] = df_cleaned["sqft"].str.replace(",", "").astype("float")

df_cleaned["baths"] = df_cleaned["baths"].astype("float")
df_cleaned["bedrooms"] = df_cleaned["bedrooms"].astype("float")
In [14]:
#extracting the city and neighborhood names from the comma separated address column
df_cleaned["city"] = df_cleaned["address"].str.split(",", expand = True)[2]
df_cleaned["neighborhood"] = df_cleaned["address"].str.split(",", expand = True)[1]

#dropping the address column
df_cleaned = df_cleaned.drop(labels = ["address"], axis = 1)
In [15]:
#getting the frequencies of city column
df_cleaned['city'].value_counts().to_frame()
Out[15]:
count
city
Dhaka 37345
Chattogram 8681
Gazipur 2674
Sylhet 338
Cumilla 298
Badda 243
Malibagh 8
Mira Bazar 2
Subid Bazar 1
Kakrail 1
Khilgaon 1

The city column has the names of these cities: Dhaka, Chattogram, Gazipur, Sylhet, and Cumilla. But the values - Badda, Malibagh, Mira Bazar, Khilgaon, Kakrail, and Subid Bazar are not any city's name. To identify what caused these values to be present in the city column, the address column from df will be reviewed by observing all the columns generated from comma separation.

In [16]:
#taking in all columns to identify aforementioned issue
com_add = df["address"].str.split(",", expand = True)
com_add.head()
Out[16]:
0 1 2 3 4
0 Darjiban Mosque Road Dargi Para Sylhet None None
1 Mojumdarpara Road Mazumder Para Sylhet None None
2 Masjid Lane Society Sholokbahar Chattogram None None
3 Masjid Lane Society Sholokbahar Chattogram None None
4 Uddipon R/A Mira Bazar Sylhet None None

Here it is visible that few rows had more than two commas, causing the neighborhood names to be in the third part or in column 2 of the above table.

In [17]:
#crosschecking if the hypothesis that city names went to column 3 is true by observing Badda's rows since it had maximum faulty occurence
com_add[com_add.iloc[:, 2] == " Badda"]
Out[17]:
0 1 2 3 4
552 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
713 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
4115 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
4117 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
4148 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
... ... ... ... ... ...
49687 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
49699 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
49825 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
49829 South Baridhara Residential Area D. I. T. Project Badda Dhaka None
49842 South Baridhara Residential Area D. I. T. Project Badda Dhaka None

243 rows × 5 columns

Instead of just replacing the faulty occurences with the most frequent city in the whole dataset, the neighboorhood groups were filtered to find most frequent city for each of the neighborhood (as multiple cities can have presence of neighborhoods with same name) and then that neighborhood name would then be replaced with the found city name.

In [18]:
#replacing neighborhood name with the most frequent city name for that neighboorhood in city column
city_replace = [" Badda", " Malibagh", " Mira Bazar", " Khilgaon", " Kakrail", " Subid Bazar"]
for i in city_replace:
  replacement = df_cleaned[df_cleaned.neighborhood == i]["city"].value_counts().idxmax()
  print(i, "will be replaced with", replacement)
  df_cleaned["city"].replace(i, replacement, inplace = True)
 Badda will be replaced with  Dhaka
 Malibagh will be replaced with  Dhaka
 Mira Bazar will be replaced with  Sylhet
 Khilgaon will be replaced with  Dhaka
 Kakrail will be replaced with  Dhaka
 Subid Bazar will be replaced with  Sylhet
In [19]:
#checking if the city column looks alright now
df_cleaned["city"].value_counts().to_frame()
Out[19]:
count
city
Dhaka 37598
Chattogram 8681
Gazipur 2674
Sylhet 341
Cumilla 298
In [20]:
df_cleaned.head()
Out[20]:
bedrooms baths sqft rent features city neighborhood
0 2.0 2.0 1000.0 12500.0 6 Sylhet Dargi Para
1 5.0 3.0 1800.0 20000.0 7 Sylhet Mazumder Para
2 2.0 2.0 900.0 12000.0 8 Chattogram Sholokbahar
3 2.0 2.0 900.0 13000.0 8 Chattogram Sholokbahar
4 2.0 2.0 1000.0 14000.0 6 Sylhet Mira Bazar
In [21]:
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Index: 49847 entries, 0 to 49847
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   bedrooms      49847 non-null  float64
 1   baths         49847 non-null  float64
 2   sqft          49847 non-null  float64
 3   rent          49847 non-null  float64
 4   features      49847 non-null  int64  
 5   city          49592 non-null  object 
 6   neighborhood  49841 non-null  object 
dtypes: float64(4), int64(1), object(2)
memory usage: 3.0+ MB

As it is mandatory to insert all the information before posting a house to rent in bproperty, we, fortunately, do not have any missing values.

Delving into the data¶

In [22]:
df_cleaned.describe()
Out[22]:
bedrooms baths sqft rent features
count 49847.000000 49847.000000 49847.000000 4.984700e+04 49847.000000
mean 2.377455 2.167493 974.394848 1.701064e+04 11.761510
std 0.630975 0.781340 415.955743 2.562665e+04 6.876985
min 1.000000 1.000000 120.000000 2.500000e+03 0.000000
25% 2.000000 2.000000 700.000000 1.050000e+04 7.000000
50% 2.000000 2.000000 850.000000 1.400000e+04 8.000000
75% 3.000000 3.000000 1200.000000 1.800000e+04 19.000000
max 7.000000 6.000000 7000.000000 4.530000e+06 46.000000

The description shows that mean number of bedrooms and baths in bproperty.com is between 2 and 3. Whereas the mean area of the listings is 974 sqft, which goes up to 7000 sqft. Moreover, area column's mean is higher than the median, indicating that majority of the listings have low asking rent and outliers on the right tail pulling the mean rent upward.

In [23]:
df_cleaned["neighborhood"].value_counts().to_frame()
Out[23]:
count
neighborhood
Mirpur 8570
Mohammadpur 4817
Gazipur Sadar Upazila 2661
Uttara 2387
Jatra Bari 1666
... ...
Banglamotors 1
South Khulsi 1
Aditya Para 1
Taltala 1
Goran 1

188 rows × 1 columns

From the frequencies, it is evident that Mirpur, Mohammadpur, Gazipur Sadar, Uttara, Jatrabari are the top 5 neighborhoods in terms of number of rent listings. The total number of neighborhoods, listed in Bproperty in all the 5 cities combined, is 188.

In [24]:
#importing necessary libraries for visualizations
import plotly.express as px
import seaborn as sns
In [25]:
fig = px.scatter(
    data_frame=df_cleaned[(df_cleaned['sqft'] <= 5000)], #to remove extreme outliers for a better view at the relation
    x="sqft",
    y="rent",
    size="bedrooms",
    color="baths",
    hover_name="city",
    size_max=20,
    opacity = 0.6
)
fig.show()

The scatterplot above shows the relationship between number of bathroom, number of bedroom, and sqft of the house or flat with the rent of the house or flat. The chart depicts the number of bedroom as the size of each listing. It is evident from the naked eye that the average size of the bubbles are increasing as we go from left to right in the plot indicating strong relationship between size of the apartment and number of bedrooms. But the relationship between rent and number of bedroom is not differentiable from naked eye.


Same relationship is also visible for number of bathrooms as well, indicating weak rent-baths relationship.


However a slight linear relationship can be seen in the graph between area of the apartments in sqft and rent of the apartments.

In [26]:
#printing correlations
print(df_cleaned[['rent', 'sqft']].corr())
print(df_cleaned[['rent', 'baths']].corr())
print(df_cleaned[['rent', 'bedrooms']].corr())
print(df_cleaned[['rent', 'features']].corr())
          rent      sqft
rent  1.000000  0.457137
sqft  0.457137  1.000000
           rent     baths
rent   1.000000  0.316865
baths  0.316865  1.000000
              rent  bedrooms
rent      1.000000  0.285499
bedrooms  0.285499  1.000000
              rent  features
rent      1.000000  0.212289
features  0.212289  1.000000

The correlation matrix matching the explanation of visualization shows only the correlation of area in sqft and rent to be considerable.

In [27]:
#plotting the rent in boxplots according to cities from address3 column
fig = px.box(df_cleaned[df_cleaned["rent"] < 100000], x="city", y="rent")
fig.show()

According to the boxplots, Dhaka has the highest median rent of 15k among all the 5 cities. It was predictable as is most populous city (scarcity of land) among all the other cities. After Dhaka comes Chittagong, Sylhet, and Cumilla with median asking rate of rent hovering from 11k to 12k. Gazipur has the lowest median asking amount of rent with 9k. So it is clear that the average rent varies significantly from city to city. Hence, this factor could also be significant while predicting rent. Apart from the median rent, boxplot is also showing a considerable number of positive outliers in Dhaka and Chittagong's rent distribution.

Building Machine Learning Model¶

Data Preparation¶

To predict rents, random forest machine learning technique will be used. Since, the city column is string and random forest method can not analyze a categorical variable, this variable will be replaced with dummy variables before passing into the the random forest algorithm.

In [28]:
#importing preprocessing module from sklearn library to encode categorical variables
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
In [29]:
#creating a new dataframe named "feature" and passingadding bedrooms, baths, features, and dummy variables from city column in that dataframe
catg_vars = ["city", "neighborhood", "rent"]
num_vars = df_cleaned.drop(labels = catg_vars, axis = 1)
num_vars.head()
feature = pd.concat([num_vars, pd.get_dummies(df_cleaned["city"])], axis = 1)
In [30]:
feature.head()
Out[30]:
bedrooms baths sqft features Chattogram Cumilla Dhaka Gazipur Sylhet
0 2.0 2.0 1000.0 6 False False False False True
1 5.0 3.0 1800.0 7 False False False False True
2 2.0 2.0 900.0 8 True False False False False
3 2.0 2.0 900.0 8 True False False False False
4 2.0 2.0 1000.0 6 False False False False True
In [31]:
#separating the dependent and independent variables.
#dependent variable
y = df_cleaned['rent'].values
print(y)

#independent variables
x = feature.values
x
[12500. 20000. 12000. ...  9000. 25000. 13000.]
Out[31]:
array([[2.0, 2.0, 1000.0, ..., False, False, True],
       [5.0, 3.0, 1800.0, ..., False, False, True],
       [2.0, 2.0, 900.0, ..., False, False, False],
       ...,
       [1.0, 1.0, 500.0, ..., True, False, False],
       [3.0, 3.0, 1580.0, ..., False, False, False],
       [3.0, 2.0, 900.0, ..., True, False, False]], dtype=object)

Using the train test split function the dataset will be split into two parts: training set and testing set. The randomforest model will be optimized using the training dataset. Afterwards, the model will be evaluated on a completely new dataset for the model- testing set. Random_state is being used to fix the split, meaning - during onl one random daatset will be generated and used throughout while optimizing the model. But after model finalization this will no longer be used so that the model can analyze true random split. The dataset is bein split into training and test data by 7:3 ratio.

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=11)
In [33]:
#importing random forest module
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

#fitting the training datasets in the model
model.fit(X_train, y_train)
Out[33]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()

The model has been built from the training dataset using Random Forest technique. Now, this model will be applied on the test dataset to predict the rent. X_test includes the independent variables- sqft, city, number of bathrooms number of features and number of bedrooms. These variables will be inserted into the model to predict the rent of the testing dataset.

In [34]:
y_pred = model.predict(X_test)

Model Evaluation¶

Now that rent has been predicted, this prediction will be used to evaluate the model's accuracy. For evaluation, the statistical tool- R square will be applied. R-square measures what proportion of the dependent variable's variation is being explained by the model using the given independent variables.

In [35]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)
Out[35]:
0.7272832643918732

Here, the R-squared score is 0.8388, which translates to 83.88% variation of the dependent variable being predicted or explained by the model that was developed.

Actual values vs predicted values presented graphically¶

In [38]:
width = 10
height = 8
plt.figure(figsize=(width, height))

ax1 = sns.distplot(y_pred, hist=False, color="r")
ax2 = sns.distplot(y_test, hist=False, color="b", ax=ax1)

plt.title('Distribution plot of predicted values by the model and actual values from test dataset')
plt.xlabel('Rent (in Taka)')
plt.ylabel('Proportion of listings')

plt.show()
plt.close()

The graph shows that almost all variations were predicted by the model, except for the high saturation in the lower segment of the rent distribution